global learning rate
e1054bf2d703bca1e8fe101d3ac5efcd-AuthorFeedback.pdf
We thank the reviewers for their time, effort, and helpful feedback. We address individual comments below. The training loss in all of our examples can be written in the form of lines 17-18. We will include these additional references and add a proof to the appendix. Consider equation (6) when the assumptions of Section 2.1 apply, i.e., Our test can be considered as a more general version of testing whether the loss has reached a constant value.
Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam
Atamna, Asma, Maus, Tom, Kievelitz, Fabian, Glasmachers, Tobias
The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer's internal preconditioning. We propose a corrected variant that better reflects Adam's update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.
FedDuA: Doubly Adaptive Federated Learning
Takakura, Shokichi, Liew, Seng Pei, Hasegawa, Satoshi
Federated learning is a distributed learning framework where clients collaboratively train a global model without sharing their raw data. FedAvg is a popular algorithm for federated learning, but it often suffers from slow convergence due to the heterogeneity of local datasets and anisotropy in the parameter space. In this work, we formalize the central server optimization procedure through the lens of mirror descent and propose a novel framework, called FedDuA, which adaptively selects the global learning rate based on both inter-client and coordinate-wise heterogeneity in the local updates. We prove that our proposed doubly adaptive step-size rule is minimax optimal and provide a convergence analysis for convex objectives. Although the proposed method does not require additional communication or computational cost on clients, extensive numerical experiments show that our proposed framework outperforms baselines in various settings and is robust to the choice of hyperparameters.
Contouring learning rate to optimize neural nets
Check out Siddha Ganju's talk on embedded deep learning at the Artificial Intelligence Conference in San Francisco, Sept. 17-20, 2017. Learning rate is the rate at which the accumulation of information in a neural network progresses over time. The learning rate determines how quickly (and whether at all) the network reaches the optimum, most conducive location in the network for the specific output desired. In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. However, there are many modifications that can be made to the original SGD update rule that relates the learning rate to the magnitude and orientation of the error gradient.
[1606.04474] Learning to learn by gradient descent by gradient descent โข /r/MachineLearning
One thing, which I'm not sure, is how correct is their comparison. By that I mean that they fix the global learning rate for the "hand designed" algos and choose it by grid search. However, we do know well that in most problems we can start with a larger learning rate an decay it over time after it platoes. The issue of not conisdering that probably the best global learning rate for the whole run, would be one which is very slow, but eventually outperforms faster ones. Nevertheless, this is an interesting work, although I'm still quite skeptical of such optimiziers to generalize well on large models.
Less Regret via Online Conditioning
Streeter, Matthew, McMahan, H. Brendan
In the past few years, online algorithms have emerged as state-of-the-art techniques for solving large-scale machine learning problems [2, 13, 16]. In addition to their simplicity and generality, online algorithms are natural choices for problems where new data is constantly arriving and rapid adaptation is imporant. Compared to the study of convex optimization in the batch (offline) setting, the study of online convex optimization is relatively new. In light of this, it is not surprising that performance-improving techniques that are well known and widely used in the batch setting do not yet have online analogues. In particular, convergence rates in the batch setting can often be dramatically improved through the use of preconditioning. Yet, the online convex optimization literature provides no comparable method for improving regret(the online analogue of convergence rates).
Tempering Backpropagation Networks: Not All Weights are Created Equal
Schraudolph, Nicol N., Sejnowski, Terrence J.
Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.